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The Poisson-Dirichlet Process 



Abstract 

The two parameter Poisson-Dirichlet Process (PDP), a generaUsation of 
the Dirichlet Process, is increasingly being used for probabilistic modelling in 
discrete areas such as language technology, bioinformatics, and image analysis. 
There is a rich literature about the PDP and its derivative distributions such 
as the Chinese Restaurant Process. This article reviews some of the basic 
theory and then the major results needed for Bayesian modelling of discrete 
problems including details of priors, posteriors and computation. 

The PDP is a generalisation of the Dirichlet distribution that allows one 
to build distributions over partitions, both finite and countably infinite. The 
PDP has two other remarkable properties: first it is partially conjugate to 
itself, which allows one to build hierarchies of PDPs, and second using a 
marginalised relative the Chinese Restaurant Process (CRP), one gets frag- 
mentation and clustering properties that lets one layer partitions to build 
trees. This article presents the basic theory for understanding the notion 
of partitions and distributions over them, the PDP and the CRP, and the 
important properties of conjugacy, fragmentation and clustering, as well as 
some key related properties such as consistency and convergence. This article 
also presents a Bayesian interpretation of the Poisson-Dirichlet process: it is 
based on an improper and infinite dimensional Dirichlet distribution. This 
interpretation requires technicalities of priors, posteriors and Hilbert spaces, 
but conceptually, this means we can understand the process as just another 
Dirichlet and thus all its sampling properties emerge naturally. 

The theory of PDPs is usually presented for continuous distributions (more 
generally referred to as non-atomic distributions), however, when applied to 
discrete distributions its remarkable conjugacy property emerges. This con- 
text and basic results are also presented, as well as techniques for computing 
the second order Stirling numbers that occur in the posteriors for discrete 
distributions. 
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4 The Poisson-Dirichlet Process 

1 Introduction 

The two-parameter Poisson-Dirichlet process (PDP), also known as the Pitman- Yor 
process (named so in |IJ01j ). is an extension of the Dirichlet process (DP). Related 
is a marginalisation known as the Chinese Restaurant Process (CRP) which gives 
an elegant analogy of incremental sampling of partitions. These models have proven 
useful in a number of ways as tools for non-parametric and hierarchical Bayesian 
modelling, especially in discrete domains such as with language and images where 
one wants to develop hierarchical models, or be flexible with dimension. 

In language domains, PDPs are proving useful for full probability modelling of 
various phenomena including n-gram modelling and smoothing |Teh06b[ IGGJ061 



IMS08] . dependency models for grammar |JGG07i IWSM08J . and for data compres- 
sion |WAG"'"09j . The PDP-based n-gram models correspond well to versions of 
Kneser-Ney smoothing [TehOGbj . the state of the art method in applications. These 
models are intriguing from the probability perspective, as well as sometimes being 
competitive in terms of performance. More generally, the models are also being 
used for clustering |GROH IRasOO] , and for related tasks such as image segmentation 
|SJ09j . relational modelling |XTYK06] . and exemplar-based clustering |TZF08j . 

PDPs and their associated distributions are basically a tool for modelling two 
kinds of objects: mixture models and partitions. They can then be used to model 
trees and other hierarchical structures. Section [2] introduces the interrelated con- 
cepts of a simple mixture model and a partition along with some of their statistical 
complications. The definition of a PDP is then presented in Section |3j The theory 
is well developed for the more general context of continuous distributions, and that 
theory is reviewed here. Details of statistics, consistency and the forms of posteriors 
are given in Section |4j Statistical analysis of partitions is given in Section |5] which 
arises when one marginalises the posterior of the PDP to obtain the CRP, which 
can be further marginalised to obtain a distribution on partition sizes. Results on 
fragmentation and grouping of partitions in Section [6] allow one to extend the CRP 
distribution to trees {i.e., nested partitions). 

A new Bayesian interpretation and definition of the PDP is then given in Sec- 
tion [7j This uses the methodology of improper priors too show that the distribution 
on the infinite probability vector underlying the PDP is in fact an infinite improper 
Dirichlet. From this one can readily obtain all the standard sampling and additivity 
properties for the PDP. 

With the use of PDPs increasing in computer science applications, where sophis- 
ticated discrete probabilistic modelling is required, this article reviews and sum- 
marises the basic theory of PDPs in the discrete context in Section [8] This context 
has distinct properties where the distribution on partition size, introduced above, 
is needed to express posteriors using Stirling numbers of the second kind. These 
play the role of the Gamma function that occurs in the posterior for a Dirichlet 
distribution. Basic conjugacy results for the PDP are then presented that make it 
such an important distribution for hierarchical Bayesian reasoning. 
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We have attempted where possible to follow conventions used in the mathemat- 
ical statistics community, and to provide some pointers to that literature. 

2 Mixture Models and Partitions 

Before introducing the PDP, we introduce the basic context of its use, a simple 
mixture model. The kinds of mixture models we consider require as input a base 
probability distribution H{-) on a measurable space X, and yield a discrete distribu- 
tion on a finite or countably infinite subset of X. This means the output distribution 
is a weighted set of impulses at points in X. 

Definition 1 (Impulse mixture model) Given a probability distribution H{-) on 
a measurable space X , assume the values X^ G X for k = 1, ..., cxd are independently 
and identically distributed according to H{-). Also an infinite dimensional probability 
vector p is sampled from a distribution Q{-) independently of each X^ so < pk < 1 
and Y^kLiPk = 1- Then 

oo 
k=l 

is an impulse mixture model with base distribution H{-) and probability distribution 
Q{-). Note Sx*{-) is a discrete measure concentrated at X^. 

An example of an impulse mixture model is given in Figure [T] where the base dis- 
tribution is a Gaussian and the probability vector p was generated by a so-called 
stick-breaking method (introduced later). 

These kinds of models have a long history and appear in various forms. In jPit96j 
the species sampling model is developed in the context of the PDP and attributed 
to R.A. Fisher for simple models of animal species. An alternative derivation takes 
the view that the p are initially unnormalised, so-called random measures, which 
one then normalises |JLP09j . There are various schemes for sampling the proba- 
bility vector p including normalized random measures |JLP09] and general species 
sampling schemes |IJ03j . The Poisson-Dirichlet distribution we present here is a 
particular version that is partially conjugate and thus admits convenient Bayesian 
computation. 

When the base distribution is continuous, such as with the Gaussian, one can 
see that almost surely no two values X^ and Xf (for k ^ I) would be the same. 
The general property is described as non-atomic, which means H{X) = for all 
X & X, then samples from H{-) are almost surely distinct. The counter property is 
where the distribution is discrete, so H{X) > for a\\ X & X and thus it is always 
possible that X^ = X* for some / 7^ k. This distinction, non-atomic versus discrete, 
has important consequences for posterior analysis as explained in Section [8j Mixed 
base distributions are not considered. 

When sampling from an impulse mixture model, one would observe a sequence of 
A^ data values Xi,X2,X3, ...,Xn, and one may also assume corresponding indices. 
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Figure 1: Gaussian base distribution H{-) and a resultant impulse mixture model 
G{-). The infinite number of smaller impulses do not show. 



the k in Formula ([I|, also exist, so these might be fci, A;2, ^3, •••, ^Af- Since the actual 
indices are latent and not part of the observed data for the model of Formula rtl]), the 
latent indices can be arbitrarily relabelled and converted to a normal form. Such 
a normal form where the labels are irrelevant is called a partition: it defines the 
grouping of items in the sequence, not the actual indices assigned. 

Definition 2 (Partition) A partition P of a countable set X is a mutually ex- 
clusive and exhaustive set of subsets of X. The partition size of P is given by the 
number of sets \P\. 

In our case, we make a partition of the data sequence {Xi,X2,X3, ...,X7v}. Par- 
titions are important because, not only do they provide the assignments in a mix- 
ture model, they can also be used as a primitive in the generation of many data 
structures. For instance, random trees can be created using partitions either by a 
top-down process of fragmentation, illustrated in Figure [2| or by a bottom-up pro- 
cess of coagulation, illustrated in Figure [3] When storing and sampling partitions, 
however, one may need to order them. Most importantly, when sampling infinite 
partitions or infinite probability vectors p, one needs to order them roughly by size: 
there is no point in generating a million infinitesimal probabilities before generating 
one with significant size. 

One obvious ordering condition to try is by size. One could make the probability 
vector p ordered so that Pk+i < Pk for all k, or order bins in the sampled partition 
by their decreasing number of entries. From a statistical perspective, however, this 
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d a,b,c,d,e,f,g,h,i,j,k,l,m,n,o J x> 
Z^^ — 7 V, " 

Figure 2: Shows a partition of a set of letters {a, 6, ...,o}, illustrating a. fragmentation. 
The top node is the full set and the bottom row is the partition displayed in order 
of least elements, i.e. a,b,c,g. 



\ a,d,k,n I,. ivb,f,h_2i,W.„ i„ c,e,l,m,o ; 1 g; 

d' 'q' ff Q 0' S ' 5 ff K "S^^^^^ 

Figure 3: Shows a partition applied to a set of sets of letters {{a},{b}, ...,{o}}, 
illustrating a coagulation. The bottom set of nodes lists the sets, and the top row 
applies the partition to the set of sets to get subsets of sets which are then unioned. 
The top row is displayed in order of least elements. 



is impractical because one does not know the value of each pk, and the sizes of the 
bins will vary during sampling. 

The size-biased order and the order of least elements are two ways of ordering 
sets that order partitions by their order of first occurrence in a sequence. The 
sets in the partition can then be enumerated according to this order. For instance, 
assuming the data sequence is 'a', 'b', 'c', ..., then the top partition in Figure [3] is 
listed left to right in order of least elements. We define slightly different versions of 
these depending in the representation of the the set being ordered. 

Definition 3 (Orders for sets) Size-biased orders are defined for index se- 
quences, probability vectors and partitions: 

• An index sequence / of length N given by ki,kn, ...,kN is in size-biased order 
if ki = 1 and fc„ < 1 + maxi<j<„ ki for n = 2, ..., A^. 

• An infinite probability vector p is in size-biased order if the elements pk have 
been reordered to be in their order of first occurrence in a random sample from 
p. 

• A partition P is in order of least elements if members are listed in order of 
their least element. 

Note these three definitions are related as follows: if we take an infinite random 
sample from a probability vector p, and then renumber the index sequence so they 
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Pi 



P2 



^^ fP-yi)(l-y2) 



Figure 4: The stick-breaking analogy for a GEM. Left shows the tree of probabihties 
being generated. Middle shows the stick lengths, with pi and p2 already broken off, 
and the remainder, (1 — Vi)(l — V2), about to be broken. 



are in size-biased order, then the corresponding renumbering converts the probability 
vector p to size-biased order. If we take an infinite random sample from an impulse 
mixture model with probability vector p and non-atomic base distribution over X, 
partition the sample according the values from X, and then label the entries by the 
order of occurrence, 1, 2, 3, ..., then the order of least elements of the partition also 
yields a size-biased order of the probability vector p. 

A normal form for index sequences can be derived by renumbering indices so that 
the sequence is in size-biased order: indices are renumbered to 1,2,3,... according 
to their first occurrence in the sequence. So a size-biased ordered renumbering of 
the indices 12, 435, 7198, 12, 12, 35, 7198 is 1, 2, 3, 1, 1, 4, 3. Also, a partition of a data 
sequence can be represented by a size-biased order of the indices. The top partition 
in Figure [3} for instance, can be represented by the sequence "12313242221 
3 3 1 3". 



3 The Poisson-Dirichlet Process 

For an impulse mixture model, we not only need a base distribution H{-) but also 
need to specify the probability vector p. Within the PDP literature, p follows a 
two parameter Poisson-Dirichlet distribution |PY97j when the probabilities pk are 
ordered by decreasing size, or equivalently by the Grifiiths-Engen-McCloskey (GEM) 
distribution [Pit06] when the probabilities p have a size-biased order. 

The GEM distribution is defined via the so-called "stick-breaking" model which 
goes as follows: 

1. We take a stick of length one and randomly break it into two parts with 
proportions Vi and 1 — Vi. The first broken stick has length Vi. 
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2. We then take the remaining part, of length 1 — Vi and apply the same process 
to randomly break into proportions V2 and 1 — V2. This second broken stick 
is the first part, of length (1 — V^i)V2. 

3. Again, we take the remaining part, of length (1 — Vl)(1 — V2) and apply the 
same process to randomly partition into proportions V3 and 1 — V3. This third 
broken stick is the first part, of length (1 — Vi)(l — V2)V3. 



Formally, this goes as follows: 

Definition 4 (GEM distribution) For < a < 1 and b > —a, suppose that 
independent random variables Vk are such that Vk has Beta{l—a, b+ka) distribution. 
Let 

Pi = V^i, Pk = {l-Vi)---{l-Vu-i)Vk k>2. 

Define the Griffiths-Engen-McCloskey distribution with parameters a, b, abbreviated 
GEM{a,b) to be the resultant distribution of {pi,p2, ■■■)■ 

Definition 5 (Poisson-Dirichlet distribution) Let (pi,p2,...) ~ GEM{a,b) 
and define p= {pi,p2, ...) to be their sorted values so that Pi > P2 > ■ ■ ■■ Then p fol- 
lows the Poisson-Dirichlet distribution with parameters a,b, abbreviated PDD{a,b). 

Here the parameter a is usually called the discount parameter in the literature, and 
b is called the concentration parameter. The term concentration when used in the 
statistics community usually means a quantity that behaves like the inverse of a 
variance. 

Why do we need two definitions, a PDD and a GEM with the same parameters 
modelling different orderings of the same distribution? 

• When estimating the probability p for the mixture model of Formula ([I]) 
and using the GEM distribution, the sorting is useful for sampling efficiency 
|KWT07] ■ One acts like one is using a PDD. 

• The PDD gives a canonical form for the distribution. Different sorts of the 
one underlying distribution can be generated with a GEM. 

• The GEM has a convenient sampling form (the stick-breaking model) that 
allows for simpler analysis. 

• When used inside a mixture model or partition model, they are indistinguish- 
able, so we use which ever is convenient. 



Samples of p ~ PDD(a, b) for different parameter settings are given in Figure pi 
showing both probabilities and log scale probabilities. One can see that for the 
initial values (the first 20 say), the effects of the discount a are not that great. 
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PDD(O.O.b) 



PDD(0.0,b) 




10 10 10' 

log rank 



Figure 5: The PDD with different parameter settings. Each hne of each plot shows 
•p ~ PDD(a,6) for a fixed discount a and a choice of concentrations h. The left 
plots are log-log scale, showing hundreds of values. The right plots are normal scale 
showing just 25 values (subsequent ones are effectively zero). 
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whereas the concentration b effectively changes the onset of the decrease. We show 
later that vector values for the PDD with discount a > are Zipfian, behaving 
roughly like pk oc A;"^/", which makes the slope on the log-log plots — 1/a. Whereas 
for the case with discount a = the values are geometric, behaving roughly like 
Pk oc e~^^''. 

A common definition of a Poisson-Dirichlet process is that it extends the Poisson- 
Dirichlet (or GEM) distribution. This definition presents the PDP as a functional on 
distributions: it takes as input a measurable space with domain X, and a distribution 
over it called the base distribution, usually represented here as H{-), and yields as 
output a discrete distribution with a finite or countable set of possible values on the 
domain X. 

Definition 6 (The Poisson-Dirichlet Process) Let H{-) be a distribution over 
some measurable space X . For < a < 1 and b > —a, suppose that p is drawn from 
a Poisson-Dirichlet (or GEM) distribution with parameters a, b. Moreover, let X^ 
for k = 1,2, ... be a sequence of independent samples drawn according to H{-) and 
independent of p. Then p and X^ for k = 1,2, ... define a discrete distribution on 
X given by the formula 

oo 

^PkSxii-) . (2) 

fc=i 

This distribution is a Poisson-Dirichlet Process with parameters a, b and base dis- 
tribution H{-), denoted PDP{a,b,H{-)). 

The Dirichlet Process (DP) is the special case where a = 0, and has some quite 
distinct properties as shown later. Note strict requirements on the sorting or ordering 
as in Definition [5] are not needed in the definition of the PDP due to the effect of 
the summation. 

The PDP is also called a stochastic process because it can be defined as a sequence 
of values Xi,X2,... G X from some base probability distribution H{-) indexed by 
integer valued time as 1,2,3,.... The stochastic process is the sequential sample 
from this PDP distribution. The conditional distribution with p marginalised out 
for this, as long as H{-) is non-atomic, is as follows: 

p{X^^^\X„...,X^,a,b,H{.)) = ^^ /^(O + E y^^xj-) , (3) 

where there are M distinct values in the sequence Xi, ...,Xn denoted by X^, ..., Xl^ 
and their occurrence counts respectively are ni, ..., um, so I]^=i ^m = N. 

The Chinese restaurant analogy for this sequential sampling process goes as 
follows: 

• A customer walks into the restaurant and sees M occupied tables where n^ 
others sit at table m enjoying the menu item X^. 
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• He can start his own table with probabihty ^^^" and receive a new item Xl^_^_i 
from menu H{-) by samphng. 

• Otherwise, he goes to one of the existing M tables with probability ^'"^" and 
enjoys the item X^. 

The Chinese Restaurant Process (CRP) is defined over the partition so generated, 
represented with a size-biased order of indices. 

Definition 7 (Chinese Restaurant Process) For < a < 1 and b > —a, gen- 
erate a sequence of integers ki,k2, ■■■ as follows: 

b + Ma ^ , ^ n 

m=l 

where M and the counts ni, n2, ...,nM are derived as for Equation 



p{kN+i\ki,...,kN,a,b) = ^ 6k^^^=M+i + I] ^"^ ^ '^fciv+i=m , 



If indices are sampled according to the CRP then one gets a size-biased ordered 
index sequence from a PDD(a,6) |Pit95t IIJOlj . and thus the CRP can serve as an 
alternative sampler for the PDP where the probability vector p is unknown. 



4 Basic Properties 

Before getting onto discrete domains, we review basic properties of the PDD, and 
of the PDP in non-atomic domains. Some of these results will be used subsequently 
to address discrete domains. 

For the sample from the distribution of Formula (|2|, a latent sequence of indices 
exists In '■= (^i, •••, ^at), however, these remain hidden (only the corresponding 
data values are known, not the indexes). For a non-atomic base distribution the 
indices are irrelevant and we can renumber them by size-biased ordering. Each 
index corresponds to a table in the CRP, and the number of distinct indices in the 
sample is the number of tables active at the restaurant. 

Our notation for statistics is as follows: 

Definition 8 (Index Statistics) When sampling independently and identically 
from a discrete distribution in the form of Formula M), one gets a sequence of 
latent indices, an index sequence of length N given by I^ = ki,k2, .■.,kj^. In Ijy 
one index value k can occur multiple times. Sort and count the N points of Ijq. 
Suppose there are M distinct values in In, k^ for m = 1, ..., M that occur n^ times 
respectively, so Y,m=i''^m = N. Call M the partition size and note it depends on 
the sample In and the sample size N. Moreover, use a size-biased ordering of the 
indices and renumber them according to this. The corresponding size-biased ordered 
index sequence is denoted I^. 
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Consider a sample with N = 7 points with latent indices In = 
12,435,7198,12,12,35,7198. Then the size-biased ordered renumbering is I^ = 
1,2,3,1,1,4,3. Thus the partition size M = 4 and occurrence counts rii = 3, 
n2 = 1, ^3 = 2 and n^ = 1. Note the partition size M for a sequence corresponds to 
the number of active tables in the CRP terminology. 

Definition 9 (Data Statistics) When sampling from the discrete distribution of 
Formula M) or via a PDF, one gets a data sequence of length N given by Sjy = 
Xi,X2, ...,Xi\i. Sort and count the N points of Sn- Suppose there are M distinct 
values in Sn, X^ for m = 1, ..., M that occur rim times respectively, so Y.m=i ^m = 
A^. For non-atomic base distributions H{-), it is safe to associate index m with data 
value X^, which is the unique size-biased ordered renumbering. The corresponding 
index sequence is denoted I"^. 

For discrete base distributions H{-), a unique size-biased ordered renumbering for 
the indices does not exist because if two data items in a sample are equal, one cannot 
be certain their latent indices are the same. 

4.1 The Dirichlet Process 

The Dirichlet Process (DP) is a special case of the PDP when the discount parameter 
a = 0. It has quite distinct properties as subsequent analysis will show. It is 
usually defined in a completely different manner to the PDP as follows. Let X be 
a measurable space. For a random probability distribution G(-) to be distributed 
according to a DP, its marginal distributions have to be Dirichlet distributions too. 
Ferguson |Fer73] gave a formal definition of the DP as follows. 

Definition 10 (Dirichlet Process) Let H{-) be a random measure on X and b > 
be positive real number. We say a random probability measure G{-) on X is 
a Dirichlet process with a base measure H{-) and concentration parameter b, i.e. 
G{-) ~ DF{b, H{-)), if for any finite measurable partition {Bi, B2, . . . , Bk) of X , the 
random vector {G{Bi),G{B2), . . . ,G{Bk)) is Dirichlet distributed with parameter 
ibHiB,),bHiB2),...,bHiB,)): 

(G(fii), G{B2), . . . , G{B,)) ~ Dtrtchlet{bH{B,), bH{B2), . . . , bH{Bk)) . 

The DP is an extension of a Dirichlet distribution, which is defined for a finite set. 
The following simple corollary of the definition above demonstrates this. 



Lemma 11 According to Definition 10, if H(-) is a categorical distribution over a 



finite space, represented by the probability vector 9 say, then the following holds 

DF{b, H{-)) = Dirichlet{be) . 

This is important because posterior analysis of hierarchical Dirichlets is intrinsically 
difficult, so posterior analysis of hierarchical DPs can help. 
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4.2 Consistency results 

The PDD can be used to approximate a broader class of distributions, not just 
those sampled from a PDD(a, b). The following lemma derived from James |Jam08l 
Proposition 2.2] shows this. This supposes a "true" probability vector q gives a 
distribution of integers and then shows a sufficient property required of q so that a 
PDD distribution can approximate it based on samples. 

Lemma 12 Suppose an integer sequence I of length N is sampled independently 
and identically according to the probabilities q where < q^ < 1 for k = 1, 2, ... and 
X^fcLi Q'fc = 1 o-nd use the notation of Definitions^ If it is assumed the q is PDD(a, b) 
for < a < 1 and b > —a, then the posterior distribution on q given I converges 
weakly to q if ^i\^^n [M/N] -^ as N -^ oo where M is the partition size defined 
in Definition^ 

Basically, we have some "true" model over samples given by the probability vector 
q. From this we compute the expected partition size ^i\^^n [M] for sample sequences 
/ of size N, and then check this grows slower than N as N ^ oo. If this holds for 
q, then the distribution q can be learnt using Bayesian methods that assume q is 
PDD(a,6). We show later that if g ~ PDD(0,6), then almost surely lE/ig-^Ar [M] is 
O(logA^) and if g ~ PDD(a,6) for a > 0, then almost surely JEi\^^n [M] is 0(7^). 

As a warning more than anything else, it is important to realise the PDP should 
not be used to approximate continuous distributions. This is made precise by the 
consistency result due to James |Jam08t Proposition 2.1]. 

Lemma 13 (PDP posterior convergence) Suppose data is sampled indepen- 
dently and identically from a Polish space X according to a continuous distribution 
Po('); ^'^'^ ^^t ^i') ^^ another distribution on X where H{-) is non-atomic. Then 
the posterior of the PDP{a,b,H{-)) distribution given the sample converges weakly 
to point mass at the distribution 

aH{-) + {l-a)Po{-) 

Hence the posterior is consistent only if either H{-) = Po(") or a = 0. 

Note discrete distributions cannot be continuous since they have finite mass concen- 
trated at points. Thus the above lemma does not apply to the discrete case. When 
the "true" distribution Po(') is discrete, weak converge does hold. 

4.3 Posteriors 

One can derive the probability of evidence or data given the model, a useful di- 
agnostic in Bayesian analysis. Various versions of this are well known, see |PY97l 
Appendix] and [Pit95> Proposition 9] , and easily proven by induction using the CRP. 
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Lemma 14 (Probability of evidence) Consider finite samples Sn = 
Xi,X2, ...jXn from PDP{a,b,H{-)), where the base distribution H{-) is non- 
atomic. Use the notation of Definition^ Then the probability of evidence given the 
model PDP{a,b,H{-)) is 



p{x,,x,,...,x^\aM = ^tS^ n ^K) n(i - «)"..-! ' 



IM 



where {x)n denotes the Pochhammer symbol x{x-\-l)...{x + N — 1) = r{x + N)/r{x) 
and {x\y)i\f denotes x{x-\-y)...{x-\-{N — l)y), the Pochhammer symbol with increment 
y, and {x\0)n^ = x^ . 



5 Probabilities of Partitions 



One can easily see the posterior in Lemma 14, with the term for the base distribu- 
tion (nm=i-^(^rn)) rcmoved represents a distribution on a partition. This section 
presents various properties of this distribution. 



5.1 Chinese Restaurant Distribution 

For a partition represented as a size-biased order I"^, the probability is a function 
of the partition size M and the occurrence counts rii, ...,nM, where by default the 
counts are listed in size-biased order. The resultant probability on I"^ is then a neat 
function f{ni, ...^um)- This is called an exchangeable partition probability function. 
This goes under various names in the literature, so the term Chinese Restaurant 
Distribution is used here to differentiate it from the CRP from which it is derived. 

Definition 15 (Chinese Restaurant Distribution) Civen a set P of size N = 
\P\, represent the partitions of P by the set of size-biased orderings of length N, 
where one is denoted I^. Define 




p{r^\a,b) 



where M and the occurrence counts ni,...,nM follow Definition \^ Call this the 
Chinese Restaurant Distribution with parameters {a,b), abbreviated CRD{P,a,b), 
and note its samples are a partition of P. 

Note this distribution runs over all possible partitions, so for instance if A^ = 3 the 
possible partitions of {a, b, c} represented using an ordering of least elements are 
given in Table [T] 
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partition 


{{a,b,c}} 


{{a,b},{c}} 


{{a,c},{b}} 


{{a},{6,c}} 


{w,m,{c}} 


partition Jjl^ 
size M 
counts n 


(1,1,1) 
1 

(3) 


(1,1,2) 
2 

(2,1) 


(1,2,1) 
2 

(2,1) 


(1,2,2) 

2 

(1,2) 


(1,2,3) 
3 

(1,1,1) 


p{r^ a>0,b) 
p{r^ a = 0,6) 


{l-a)(2-a) 
(6+l)(6+2) 

2 
(b+l)(fe+2) 


(b+a){l-a) 

(fe+l)(fe+2) 

b 


(6+a){l-a) 

(6+l)(fe+2) 

6 


(6+a)(l_a) 

(fe+l){6+2) 

b 


{b+a)(b+2a) 
(6+1)^6+2) 


(b+l)(fe+2) 


(b+l)(fe+2) 


(6+l)(fe+2) 


(b+l)(b+2) 



Table 1: Space of partitions over {a, b, c}. 



5.2 Partition size 

The key characteristic of a sample from the PDD or a CRD is the partition size M 
from Definition |8j This is related to the expected posterior probability of starting a 
new bin (for the PDD or CRP) or a new data value from X in the non-atomic case 
of the PDP. This is given by the formula for the unseen part of the CRP, 



p{kN+i ^ In I In, M, a, b) = p{Xn+i ^ Sn \ Sn, M, a, b) 



b + Ma 

N + b 



The posterior distribution for the partition size given just the sample size introduces 
a significant function, S^^, which is a generalised Stirling number. It was applied 
to the task by Pitman |Pit99[ Equation (89)] where it was represented as a{N, M, a) 
and by Teh |Teh06a] in the form Sa{N, M ), as a generalised Stirling number of type 
(— 1,— a,0) attributed to Hsu and Shiue, where it was applied to the analysis of 
hierarchical PDPs. The case for a = was first presented by Antoniak |Ant74l 
pll61]. 



Lemma 16 (Probability on partition size) Consider the size-biased ordering 
of indices I^ for a sample of size N from a PDD with parameters (a, b). The proba- 
bility distribution for M given just N and integrating over all possible partitions I^ 
of size M is 



p{M\N,a,b) = M^s"" 



(b) 



N 



M,a^ 



where 



(4) 



'~'M,a 



N\ 



E 



J2"n^=N, n„>l"^=l 




for M <N and else. 



1 — a" 



"m-l 



n^ 



Ur 



N-ET=in 



(5) 



The following expressions are useful for computing S 



N 
M,a- 



Theorem 17 (Expressions for S^j ^) 
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(i) Linear recursion: Sj^.f'^ = Sj^,j_i ^ + {N — Ma)Sj^,[ ^^ 
Boundary cond. : Sfj „ = for M > N, Sq^ = 6 



N,0- 



N-M+m /jv\ N-M+1 

(n)Mult.recurstons:SZ,a= E W'^.aS^M--l^,a = H (V^ SlaSZ~-la 



n=m ^™' n=l 



S^^^ = T{N -a)/T{l-a). Any Q < m< M . 

(iii)Explicit expression: S^^^ = ^^ ^ («)(-)" IT {h-am) for a > 



qN 
^M,0 



^^^ ■ " m=0 /i=0 

(-)^^-i d^'-' / T{N-ay "=° 
(M-1)! ^a^-^-i I r(l-a) 



fiv) Asymptotic expr.: SZ n — -^^, r ^ , , ^, — rr-r^rr — for a > 

y / y y y M,a r(l-a) r(M)a^-i N^ ■' 

(v) Expr. for a = 0; 5*^0 = kkr I = unsigned Stirling^ of 1st kind \AS74^ 

The asymptotic expression holds for N ^ oo and fixed M and a. 

The explicit closed form (Hi) was developed in |Tos39j and presented in |Pit06] . 
This explicit form becomes unstable for large values of M since it is effectively an 
M-point interpolation to a partial derivative, thus has a lot of differences of similar 
values. It remains effective for small M. Some examples are: 



qN 
^2,0 



S^,o (UN) - ^o(l)) ^J^o = 2^S. {Mn) - Ml) + (MN) - Mi)f 

qN _ I f qN qN \ qN _ I f qN r^qN j_ qN 



Figure |6] illustrates the shape of the distributions and their location for different 
values of a and b and fixed A^ = 1000. Similar looking plots are produced when 
A^ = 10000. Note the distribution does reflect a Poisson in some ways, being skewed 
both at the lower boundary M = and the upper boundary M = N, and being 
fairly symmetric in other cases. Figure [7] illustrates the shape of the distributions 
and their location for different values of a as A^ grows, for b = 50. The figure for 
a = 0.9 has a different horizontal scale. Note also how the spread of M increases as 
the sample size N increases. 

5.3 Convergence results 

It is well known that expected partition size for the DP (PDP with a = 0) is 0(log A^) 
and for the PDP it is 0{N°'). Here the exact rates are presented along with their 
expected variance |YS00j . Further details of moments for the PDD are also given 
by Ishwaran and James |IJ01j . 

Lemma 18 (Expected partition size) In the context of Definition [^ if a par- 
tition sequence Jj^ has the probability vector p distributed a priori according to 
PDD{a, b), the expected a posteriori M for a sample of size N denoted ^^a,b,N [M] 
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Figure 6: Posterior probability on M given A^ = 1000 and different a. 
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Figure 7: Posterior probability on M for increasing A^ and fixed b = 50. 



(and note the actual sample is unknown here, just its size N is known), when a > 
is given by 



^p[a,b,N [M] 



b{b + a) 



N 



a {b)]\f a 

= b{Mb + N)-Mb)) + 0{ab\og^N) 
b f NY ( aN \ b 



for N,b:^ a 



where {x)j\f denotes the Pochhammer symbol x{x + l)...{x + N — 1) = r{x + N)/r{x). 
The a posteriori variance of M for a sample of size N, denoted Var^a,b,N [M], when 
a > is given by 



Varpia,b,N [M] 



b{a + b){b + 2a) M b{b + a)N ( b {b + a) 



~ a'v' + y; '"^K& + iV), 



In the context where a = 0, 



N 



a^ (6);v a {b)N \a {b)^ 

b /. A^\2" f aN \ 



for N,b:$> a . 
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~ 6 log f 1 + — j for A^, 6 > , 

Varp\,,b,N[M] = b{Mb + N)-Mb)) 

+ b\Mb + N)-Mb)) 
~ 6 log f 1 + — j for A^ > 6 > , 

where ipol') is the digamma function andipil-) is the 1-st order poly gamma function, 
the derivative of the digamma function. 

Thus for < a < 1 and h fixed, Wj^a,b,N [M] is almost surely sublinear in A^ as 
described in Section l42l 

Note Wjj^a,h,N [M] is roughly linear in h in all cases. For the DP case (when 
a = 0) and A^ ^ 6 ^ 0, the a posteriori standard deviation of M is approximately 
the square root of lEp^^f^Af [^]; so M is somewhat Poisson in its behaviour. For 
the PDD a > and A^ ^ h ^ a, the a posteriori standard deviation of M is 
approximately Wjp^a,b,N [M] j Jhja^ so is smaller than ^.^a^h.N \M\ for b ^ a. 

To compare convergence of PDD distributions with known series, we use the 
following lemma. 

Lemma 19 (Upper bound on expected partition size) Suppose a partition 
sequence Jj^ of length N is sampled independently and identically according to the 
probabilities q where < q^ < I for k = 1,2, ... and J^'kLi Qk = ^ o^nd use the notation 
of Definition^ If q takes the form of a geometric series, q^ = r''^^(l — r), then 

\ogN 1 + 21og 1/r + loglog 1/r 



^i',\AM] < 



log 1/r log 1/r 



// q takes the form of a Dirichlet series qk = k ^C{s) for s > 1 (where ({s) is the 
Riemann zeta function), then 

s f N V^" 

The bounds are often quite good. Experimental evaluation shows the geometric 
series bound is close to about 20% except where r approaches 1, and the Dirichlet 
series bound is close to about 20% except where s approaches 1. 



Comparing the expected partition sizes of Lemma 18 with the different conver- 
gent series above, one can see that the PDD case for a > behaves more like a 
Dirichlet series with exponent s = 1/a, whereas the DP case (for a = 0) behaves 
more like a geometric series with factor r = exp(— 1/6). 

5.4 Dirichlet-Multinomial models 

It is instuctive to compare the CRD with the Dirichlet-multinomial model obtained 
by marginalising out parameters from a multinomial posterior. This is a posterior 
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C L a,b,c,d,e,f,g,h,i,j,k,l,m,n,o Z ^ 
/ 



\a,d,k,n ) \hjhi}) i c,e,l,m,o ^/ l.g,,' 

M'" ig' * J^ \ Jf ^ 'Jill,. ^iL 

Figure 8: Shows two fragmentations of a set of letters {a,b, ...,o}. The top node is 
the full set, the middle row is after the first fragmentation, and the bottom row is 
after the second fragmentation. Each row is in size-biased order. 



on assignments of N data points to K classes, rather than a partition of N data 
points. So given A^ data in total assigned to K classes with counts n = (ni, ...^tik)-, 
and using a prior of Dirichleti^ (a): 

/^|,. X r(6) Y\kT{nk + ak) 

p(n\lS,a) = ; r ; r 

1'')^ k 

where b = J2k otk- Here the second line represents the formula using Pochhammer 



symbols to bring out the correspondence with Definition 15 Comparisons are as 
follows: 

• For the Dirichlet-multinomial, the number of classes K is fixed, for the CRD 
the size of the partition (M is sometimes used) is varied as well, and has the 



posterior given in Lemma 16 



• The n/c — 1 subscript in the Pochhammer symbol in Definition 15 loses 1 due 
to starting off the new bin in the partition. 

6 Fragmentation and Coagulation 

Fragmentation is the term used when a partition created by one distribution is fur- 
ther split using partitions created by a second distribution. In a simple finite case, 
the technique is illustrated in Figure [8j Fragmentation works as follows: every set in 
the partition is further partitioned. A complementary process is bottom-up, called 
coagulation, and is illustrated in Figure [9j Coagulation makes a partition of the set 
of sets forming the original partition, and then merges entries in the one bin. The use 
of the second partition here is a bit more indirect. So, for instance, at the bottom row 
of Figure|9| the initial set is {{a}, {6, /, j}, {c}, {d, fc}, {e,m}, {g}^ {h^i}., {/, o}, {n}}. 
Partition this set into 4 parts (1) {{a}, {d, A;}, {n}}, (2) {{&, /, j}, {/i, «}}, (3) 
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V.„.„. a,b,c,d,e,f,g,h,i,j,k,lm,n,o _„,„„> 
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Figure 9: Shows the coagulations reversing the fragmentations in Figure [8j The 
bottom row is the initial partition. The middle row is a coagulation of the bottom 
row, and the top row is a coagulation of the middle row. 



{{c}, {e,m}, {/, o}} and (4) {{(7}}, and then flatten these sets to get a partition 
of {a,(i, A;,n}}, {6, /, h,i,j}, {c, e, /,m, o}, and {g} as shown in the second row. 

Definitions below are given in terms of sets. Conversion to size-biased ordering 
is detailed but not difficult. 

Definition 20 (Fragmentation of partitions) Consider a partition P repre- 
sented as a set of sets {Pi,P2, ...,Pm}, and a sequence of partitions Qi,Q2, ■■■,Qm of 
the sets Pi, P2, ..., Pm respectively. Then the fragmentation ofP using Qi,Q2, ■■■, Qm 

Definition 21 (Coagulation of partitions) Consider a partition P represented 
as a set of sets {Pi,P2, ...,Pm}, and a second partition Q of {1,2, ...,M} where 
M = \P\. Then the coagulation of P using Q is {[JmeqPm '■ Q £ Q}- 

As seen in the figures, fragmentation and coagulation would seem to be complemen- 
tary in their way of changing a partition, one splits and one merges, but both driven 
by partition templates. 

6.1 Operations on the CRD 

In some cases, the two operations are also statistically complementary when applied 
to samples from the Chinese Restaurant distribution. This is usually defined in 
terms of fragmentation and coagulation functionals over distributions |Pit06j . For 
simplicity we present the operations directly in terms of Definitions 20 and 21 as 
follows: 



Theorem 22 (Fragmentation of CRDs) For 0<ai<l,0<a2<l and b > 
—aia2 and a set P, if a partition Q sampled from CRD{P,aia2,b) is fragmented 
with partitions sampled from CRD{Qm,ai,—aia2), for Qm G Q, then the resultant 
fragmented partition is distributed as CRD{P,ai,b). 
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Theorem 23 (Coagulation of CRDs) For < ai < 1, < a2 < I and b > 

—aia2 and a set P, let I ~ CRD{P,ai,b), the partition size of I is M, and J ~ 
CRD{I,a2,b/ai) then a coagulation of I with J is CRD{P,aia2,b). 

Thus one coagulates to convert a partition from CRD(P, ai,6) to be a partition 
from CRD (P, aia2, 6), and conversely, one fragments to convert a partition from 
CRD (P, aia2, b) to be a partition from CRD (P, ai, b). 

Note that if we traverse down a tree levelwise, it corresponds to performing a 
sequence of fragmentations. Likewise, if all leaves occur at the same depth (a condi- 
tion that can be enforced by the insertion of dummy nodes), then one can traverse 
up the tree levelwise, and it corresponds to performing a sequence of coagulations. 
If we have a schedule of strictly increasing discounts 01,02,03, ... and a maximum 
tree depth, then we can generate a tree with leaves in {1,2,..., A^} levelwise as given 
in Algorithm [TJ For the partitions generated, it is readily seen that 

Algorithm 1 Sampling a tree with A^ nodes using schedule oi, 02, ..., amaxdepth- 



1. 


root = {l,2,...,iV}; 


2. 


tree = (); nodes = (); 


3. 


list ~ CRD(root, oi, b); 


4. 


for m G list do 


5. 


push{tree,parent{m, root)); 


6. 


push{nodes, (m, 1)); 


7. 


end for 


8. 


while {node, depth) = pop{nodes) do 


9. 


if depth < maxdepth then 


10. 


if \node\ > 1 then 


11. 


list ~ CRD{node, aaepth+i, -adepth); 


12. 


for m G list do 


13. 


push{tree, parent{m, node)); 


14. 


push{nodes, [m, depth + 1)); 


15. 


end for 


16. 


else 


17. 


{Copy through to next depth.} 


18. 


push{tree, {node, depth + 1)); 


19. 


push{nodes, {node, depth + 1)); 


20. 


end if 


21. 


end if 


22. 


end while 


23. 


{Tree is represented as a list oi parent{-, ■) relations.} 


24. 


return tree; 



the partition at depth D (such that 1 < D < maxdepth) as generated will be 
distributed as CRD {root, aD,b); 
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Figure 10: Distribution over the number of children for internal nodes based on 
discount schedule, thus using CRD {node, ad+i, —cld)- 

• the partition applied to a single node (excepting the root node) at level D is 
CRD {node, ao+i, —cld)- 



The expected partition sizes can then be inferred from Lemma 18 For instance, 
if Oi = 0, then the first level below the root will have on average approximately 
61og(l + N/b) nodes. Subsequent node partitioning will occur independently of h 
as the subsequent CRD parameters are set by the discount schedule only. This is 



illustrated in Figure [10| For instance, when splitting at a level where the discount is 
0.56 and the discount for the next level down is 0.70, then look at the third plot with 
discounts labelled (0.56,0.70) which represents applying a CRD (node, 0.70, —0.56). 
By the orange colour, about 50% of nodes of size 10 remain unpartitioned, but the 
remained are split into 2-8 say children. Whereas for the fourth plot with discounts 
labelled (0.84,0.98), 80% of nodes are maximally partitioned into single element 
nodes (leaves) and some small percentage will remain unpartitioned. One can see 
from this that by judicious use of a discount schedule, one can generate trees that 
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split with increasing propensities at lower levels. 

6.2 Operations on the PDD and PDP 

Fragmentation and coagulation also apply to the distributions on infinite partitions 
represented by PDDs (or GEMs) and their counterpart the PDP. 

The basic construction is most easily understood in terms of the basic summation 
form of Formula (12). We start with a simple draw from a PDP, Y.kLiPk^xii')- 
However, we now replace the draws from the base distribution X^ by another draw 
from a PDP with probability vector qk, so we get 

oo oo oo 

J2PkY.^k,j5x;J-) = J2(PkQk,j)5x*J-) ■ (6) 

fc=l j=l k,j=l 

In some circumstances the terms PkQkj can be shown to follow a PDD distribution. 
The general result |Pit06j reworded for simplicity is as follows: 

Theorem 24 (Fragmentation of PDDs) For < ai < 1, < a2 < 1 and b > 

—aia2, if p ^ PDD{aia2,b) and qk ~ PDD{ai,—aia2) for each k = l,...,oo then a 
sort of the resultant {pkQkj '■ j,k = 1, ..., oo} is PDD{ai,b). 

Corollary 25 (Fragmentation of PDPs) Let H{-) be a distribution over some 
measurable space X . For < ai < 1, < 02 < 1 and b > —aia2 introduce a latent 
distribution Q{-) 

g(-) ~ PDP{aia2,b,PDP{ai,-aia2,H{-))) . 

A sample from R{-) is taken by first drawing a sample Q{-) by the nested PDP above, 
and then taking a sample from Q{-). Then it follows that 

R{-) ~ PDP{a,,b,H{-)) . 

Coagulation is the inverse of fragmentation so one follows a similar explanation. 
Start with a simple draw from a PDP, J^kLiPk^Y'i')- However, we now wish to 
cluster some of the values Sy*{-)- To do this we obtain a second draw from another 
PDP, Q{-) = J^'jLi Qj^x*(y-), and replace each Y^ by a sample X*^ ~ Q{^)■ So we get 

00 00 / \ 

Ep^H^-) = E E pA5x;{-). (7) 

fc=l i=l \k:jk=j J 

In some circumstances the terms Yjk:jk=jPk *"^^ ^^ shown to follow a PDD distri- 
bution. The general result |Pit06] again is as follows: 

Theorem 26 (Coagulation of PDDs) For < ai < 1, < 02 < 1 and b > 

—aia2, if p ^ PDD{ai,b) and q ^ PDD{a2,b/ai) and for k = 1, ...,00, jk ^ q, then 
a sort of the resultant \j2k:jk=jPk : j = 1, •••, ooj- zs PDD{aia2,b). 
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Corollary 27 (Coagulation of PDPs) Let H{-) be a distribution over some mea- 
surable space X. For < ai < 1, < 02 < 1 and b > —aia2, sample a distribution 
R{-) using a latent distribution Q{-) as follows: 

Q(-) ~ PDP{a2,b/a^,H{-)) R{-) ~ PDP{a,,b,Q{-)) . 

Marginalising out Q{-), it follows that 

R{-) ~ PDPiaia2,b,H{-)) . 

The intrinsic nature of fragmentation and coagulation is typified by Equations ^ 
and ([7]) respectively and they demonstrate a duality: for < 02 < ai < 1 and 
b >= -02, 

fragmentation 

PDD(ai,6) PDD(a2,6) 

coagulation 

Moreover, fragmentation is an operator for simplifying nested PDPs whereas coag- 
ulation is for simplifying hierarchical PDPs (as used in |WAG^09j ). 

7 Improper Priors 

A distribution or prior is called proper Hit integrates (or sums) to one. The Bayesian 
theory of improper priors allows one to extend the space of reasonable priors. The 
idea is that if the posteriors from the prior are always proper, then perhaps one 
can represent the improper prior as a sequence of proper priors. The limit of this 
sequence may not be proper, but at least its posteriors all are. In this section we 
develop an improper prior that corresponds to the PDD. 

With any L^ distance for d > 1, the infinite-dimensional probability vector p 
of Formula ^ defines a Hilbert spacq^ It is difficult to define a prior probability 
on such a space because not only does one require a measure be defined for the 
infinite vector, it must be normalised, so the total measure is 1. Some theories 
just give priors for finite linear projections of the full Hilbert space, for instance the 
cylindrical measures of Minlos |Min01j . This is sufficient according to Carathodory's 
extension theorem, see Bogachev |Bog07 , to define the prior on the full spaca^ For 



the PDD model, an additional problem is the projections of the prior on finite vector 
subspaces appear to be improper as well. Thus, the best one can do is define a prior 
in terms of a measure for all finite sub-vectors as follows: 



^Only when d > 1 is the subsequent distance guaranteed to be finite for any two members of 
the space. 

^The cylinders form a semi-ring, and we have a countably additive (pre-) measure on the semi- 
ring, this impHes a unique extension on the generated ring, the sigma-algebra is generated by the 
cyhnder sets, and Caratheodory's extension theorem shows that there exists a unique extension of 
the (pre-)measure to the sigma-algebra. 
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Definition 28 (Improper prior for PDDs) Given parameters {a,b), where < 
a < 1 and b > —a, define the improper prior for PDDs (an unnormalised measure) 
as follows. Take any reordering of the infinite- dimensional probability vector p, and 
then for every sub-vector pi,p2, ■■■,Pm of the reordering, use the following measure: 

/ \b+Ma-l il , 

P{Pi,P2,...,Pm,Pm) ■■= [Pm) [[Pm , 



171=1 



where p+ = 1 - J2m=i Pm- 

Note this applies to every sub-vector, so ordering of the probabilities is not needed 
as in Definition|5} The measure p{pi,P2, ■■■iPm-, Pm) iii the definition is an instance of 
an M + 1-dimensional improper Dirichlet with parameters (—a, —a, ..., —a, b-\-M a), 
denoted here informally as 

DirichletM+i(— ct, —a, ..., —a, b -\- M a) . 

Moreover, note that we believe this measure has no corresponding limit form as 
M — )■ oo on the full infinite-dimensional probability vector p. Given an improper 
prior measure, one can infer a posterior measure using an unnormalised version of 
Bayes theorem. If the posterior measure can be normalised, then the posterior is 
now a correct probability. 

It is shown next that the definition is consistent in the sense that the measures for 
different sub-vectors are natural extensions of one another. This property is called 
additivity for proper Dirichlets and is well-known. It is plausible that it should hold 
for the improper measure too, but the standard proofs cannot be transferred since 
the involved integrals no longer exist. Here we check additivity does transfer to 
improper Dirichlets. 



Lemma 29 (Consistency of projections) In the context of Definition 28, if the 
prior measure for pi, ...,pm is projected down to some sub-vector, say pi, ...,pl for 
L < M , then the projected measure is consistent with Definition i 



We must now show the improper prior for PDDs is well defined. This is done using 
the L\ (total variation) distance defined for probability density functions -ff(-) and 
(?(■) as follows 

Li(/7,G) = f\H{p}-G{p}\dp. (8) 

The theorem below says that a sequence of proper priors exist that can approxi- 
mate the improper prior for PDDs arbitrarily closely in the sense that their poste- 
riors given any sequence sample can be made arbitrarily close to the corresponding 
proper posterior of the improper prior. Closeness here is measured by total variation 
distance. 
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Theorem 30 (Justifying improper prior) Using the notation of Definitions 
and^ there exists a set of proper priors Gs for 6 > such that for any e > and 
any sample I^ there exists a 5q such that for all < 6 < 6^ the proper posterior (I) 



given I^j of Lemma 31 is within e by the Li distance of the posterior ofGs given I"^. 



Because the improper prior is well defined, one can justifiably obtain posteriors 
and sampling results from the prior. Now these are identical to those for the PDD 
and PDP, as we detail below, however they were derived from the improper prior, 
not from any of the standard definitions for PDDs or PDPs. 

Lemma 31 (Some properties) Using the improper prior for PDDs with param- 
eters {a,b) and non-atomic base distribution H{-), the following holds: 



erior 



Proper posteriors (I): Using the notation of Definitions^and 28. The post 
distribution given I'^ is 

(^Pi,...,Pm,Pm)\In ~ Dirichlet{ni- a,...,nM - a,b + Ma) , (9) 

where p+ = 1 - Em=iPni- 



Proper posteriors (II): Using the notation of Definition \^ in the case of arbi- 
trary samples Sn, the posterior distribution given S^ is 



M 



Xn+i\pi,...,Pm,Pm,Sn ~ pljH{-)+J2Pm^x^{-) (10) 

m=l 

(pi,...,Pm,Pm) \Sm ~ Dirichlet{ni — a,...,nM — ci,b + Ma) , 

where p+ = 1 - Em=iPm- 

Sampling: Using the notation of Definition [^ if we marginalise out the prob- 
ability vector p, then the posterior distribution in the next sample X^v+i, 
p{Xn+i I Sn), is 

b + Ma ^ Um-a 

Stick-breaking: A stick-breaking like construction holds for the posteriors (I) 
and (II) above. That is, for 1 < m < M 

m—l 
Pm = Kn n (1 - ^0 

where each Vm is independent Beta (n^ni — a,b + ma + J2iLm+i '^ij ■ Since the 

('^m, Z^i^m+i'^O ^''^^ count terms, one can say each Vm has an improper prior 
Beta{—a,b + ma). 
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Size-biased sampling: The prior on a size-biased ordering from the improper 
prior for PDDs with parameters {a,b) follows a GEM{a,b). 



The posterior formulation for PDPs corresponding to Equation (10) is attributed to 
Pitman jPit96j by Ishwaran and James |lJ01i Section 4.4]. The sampling result is the 
standard Chinese Restaurant Process for the PDP from Ishwaran and James |IJ01l 
Section 2.2]. The stick-breaking result here is different to the standard PDP |IJ01l 
Section 2.1], which has stick priors Beta(l — a,b + ma) (that is, it is proper), see 
|PY97j . Here we use improper priors Beta(— a, b + ma), which matches the sampling 
of the CRP as described above. 

Now the PDD(a, b) distribution is defined with sorting, whereas the improper 
prior for PDDs with parameters a, b is not. Therefore they do not correspond directly 
unless some sorting is done. So we can say that sorting the p for an improper prior 
for PDDs yields a PDD(a, 6) distribution. 

8 The Discrete Case 

Now consider the case of discrete base distributions. In this case, H{-) is a proba- 
bility function, not a probability density function, so for each sample Xk from H{-) 
its probability is finite and thus identical draws can be repeated. This makes the 
evidence calculation of Lemma [14] invalid whenever the PDP is used in discrete or 
hierarchical contexts. Here we present the techniques used to get around this. 

If we have not been given the index sequences for a sample from an impulse 
mixture model, we can only guess what the indices might be. In this case, the detail 
of the data partition is partially hidden. So for instance, consider the sample of 
words: 

"from" , "apple" , "to" , "from" , "from" , "cat" , "to" , . . . 

This can have a size-biased ordered index sequence of I^ = 1,2,3, 1, 1,4,3. How- 
ever, since H{-) is discrete, it could be that the three instances of "from" come 
from different indices in Formula (pi). Some other size-biased orderings of indices 
compatible with this sequence of words are as follows: 

1,2,3,1,1,4,3,..., 1,2,3,1,4,5,3,..., 

1,2,3,1,1,4,5,..., 1,2,3,4,5,6,3,..., 

We are unable to say which is correct, however, they have a single coarsest ver- 
sion. Thus we introduce additional latent variables to account for the uncertainty, 
introduced next. 

8.1 Multiplicity 

The definition below, multiplicity, measures the cardinality of the (unknown) set of 
indices contributing to one observation X that occurs multiple times in the data. In 
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Word map 


Index sequence I^ 


Multiplicities t 


"from", "apple", "to", "cat" 
"from" , "apple" , "to" , "from" , "cat" 
"from" , "apple" , "to" , "cat" , "to" 
"from" , "apple" , "to" , "from" , "from" , "cat" 


1, 2, 3, 1, 1, 4, 3 
1, 2, 3, 1, 4, 5, 3 
1,2,3,1,1,4,5 
1, 2, 3, 4, 5, 6, 3 


1,1,1,1 
2,1,1,1 
1,2,1,1 
3,1,1,1 



Table 2: Multiplicities from different index sequences of the sequence Sj 
{ "from" , "apple" , "to" , "from" , "from" , "cat" , "to" }. 



the Chinese restaurant analogy, the multiplicity for the data value X is the number 
of active tables with the particular menu item X. 

Definition 32 (Multiplicity) Consider Definition [^ but now assume that the 
base distribution H{-) is discrete. For a given sample S^, let I^ be a size- 
biased ordered index sequence matching Sn, and assume they are represented as 
Sn = (Xi, ...,Xn) and In = {ki, ...jk^). The multiplicity of the value X G Sn is 
defined as the size of the set {/c„ : n = 1, ..., A^, X„ = X }. 

Multiplicities are statistics from the latent indices In and are thus themselves latent. 
Continuing the words example at the start of Section [8} some of the potential size- 
biased ordered index sequences are illustrated in Table [2j The first column gives 
the map from index to word, the second column gives the resultant index sequence, 
and the third column lists the multiplicities for the words "from" , "apple" , "to" , "cat" 
respectively {i.e., in size-biased order). Note, for instance, in the last row, the word 
"from" appears 3 times in the map and thus has multiplicity 3. 

For the discrete base distribution, we must consider the situation where the 
multiplicities can be greater than one, so a more general probability of evidence 



result is needed, for instance for Lemma 14, since PDP(a, 6, H) returns values from 



X, but no indices. The following corollary of Lemmas 14 and 16 is a special case of 



|Teh06at Equation (31)], there proven directly for the hierarchical PDP. 

Corollary 33 (Evidence for discrete case) Consider the probability of evidence 
for a finite sample Xi,X2, ...,Xiv from PDP{a,b,II) with discrete base distribution 
if(-). Use Definitions^ and let tm be the latent multiplicity of X^ in the sample, 
and let their total Y.m=i ^m = T. Note they must satisfy the constraints < t^ < '^m 
and tm = if and only if Um = 0. Then the joint probability of the sample and the 
multiplicities is: 

P(X1,X2, 



, Xiv,ti, ...,tM I a,b,H{-)) 



{b\a) 



(b) 



N m=l 



M 

n {Hix: 



orim 

tm ■,0' 



where S^j^^ is defined in pi). 

Notice if one is Gibbs sampling with the latent multiplicites tm, then one needs to 
sample tm for all values from 1 up to n^. This can be a problem if n^ = 1000000. 
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8.2 Table indicators 



A second representation developed in |CDB11] stores a table indicator for each data 



item in a way that makes it exchangeable. For this, let the table indicator r„ be 
zero if the data Xn does not contribute a new table, and one if it does contribute a 
new table. The multiplicity tk for the data value X^ is then computed as the sum 
for those with the same data value, so tk = Z]^=i^nlx„=x*- Since we are invariant 



k 



as to which of the data starts a table, and there are C"^ choices, the above posterior 
is modified to yield: 

Corollary 34 (Evidence for discrete case with table indicators) Following 



the situation of Corollary 33 



piX,,X,,...,Xr,,r,,...,r^\a,b,Hi-)) = ^ ft f^K^^r^a-A j , (H) 

where the t^ are derived from the indicators r„. 

Note the table indicators do not appear explicitly in this other than through tm and 
thus we can "forget" the table indicators and randomly resample them as needed 
at any stage of Gibbs sampling since by symmetry their probability of being 1 is 

This representation has two major advantages: one only needs to incrementally 
change the tm, not explore all possible values between 1 and n^, and it only requires 
the use of ratios of Stirling numbers (for instance S^^i al St'^^a) which can be easier 
to compute. 



8.3 Moments 

Useful quantities to understand the application of the PDF to a discrete base dis- 
tribution, especially for the hierarchical case, are its moments. We give them here 
so we can properly interpret the discrete case. 

Lemma 35 (Moments for the discrete case) Assume the discrete base distri- 
bution H{-) is over the integers IN , with probability vector 9, so there is probability 
Ok for the value k. Let p ~ PDP{a,b,H). Then the mean, variance, covariance 
and third order moments of p according to this prior are given by 

]E[p\ = 6. 
Var[pk] = l^0kil-0k) 

Cov[pk^,Pk2\ = -7--r^0kidk2 
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IE [{Pki - Ok,){pk2 - dk2){Ph, -dk^)] 

2 (b+ij(b+2) ^fci^fc2^fc3 when ki,k2, h disjoint 

(b+i)(&+2) (^^fci ~ l)^fci^fc2 whenki = k2^h 

(l+i)(fe+2J ^fci(l - ^fci)(l - 26*^ J whenki = k2 = k3 

— * 

Now consider the case where -ff(-) has domain l,...,i^, and probabihty vector 9. 
Denote this by discrete (^). Consider a K dimensional Dirichlet distribution with 
parameters given by a6. This has corresponding moments 

E[p1 = 9. 

Var[pfc] = -^9k{l-9k) 

Cov[pk„Pk2] = r-r0ki9k2 

a + i 

E [{Pk, - 9kJ{Pk2 - dk2){Pk3 -dki)] 

(a+iKa+2) ^fci^fc2^fc3 when /ci, /cg, /cs disjoint 

(a+iKa+2) (2^fei - l)^fci^fc2 when ki = k2 ^ k^ 

(a+iKa+2) ^fci(^ - ^fci)(l - 2^fci) when ki = k2 = k^ 

Thus we can conclude following: When < a ^ 1, then we have that 
PDP(a, 6, Categorical(^)) is approximated by a Dirichlet (fzf^ (and it is already 
known equality holds when a = 0). The two distributions differ by a factor of 0{a^) 
in all the moments of order one to three. Thus the PDP applied to finite discrete 
distributions is approximated by a proper Dirichlet. 

8.4 Computing Stirling numbers 

To work with the discrete case, one needs to sample the multiplicities tm- These can 
be sampled using Gibbs sampling and precomputed tables of the Stirling numbers 
^?a- When used in sampling, the Stirling numbers S^^ need to be tabulated or 
cached for the required discount parameter a. Because they rapidly become very 
large, they need to be stored and computed in log format. The recursion must be 
used rather than the exact formulation, and becomes 

logS"+i = logs- + log (exp (logs- - logs-) + in-ta)) . (12) 



The log() and exp() functions can make the evaluation slow if implemented naively. 
Moreover, memory requirements for the full table of S-^ with a fixed discount a 
is n{n + l)/2 fioats for t < n. When keeping the discount a fixed, we can reduce 
memory with two tricks: (1) placing a maximum value on t say 100 or 1000 to limit 
the cache and (2) striping the cache for higher values of t, so only every L-th value 
is stored, entries of S-^ for n = N, N + L,N + 2L, ... and t < n. Computational time 
in this case becomes 0{2^/L). These two techniques save considerable memory at 
a small factor in computational time or sampling accuracy. 
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Table 3: Comparative values for ratio versus log calculations for a = 0.5 



{n,t) 


double V,^, 


float V,^, 


float exp (logs,';, - \ogS^_,^,) 


(10000,10) 
(10000,100) 
(10000,1000) 


0.222133 
0.0201025 
0.00189684 


0.222124 

0.0201025 

0.00189685 


0.180696 
0.0262359 
0.00181349 



8.5 Ratios of Stirling numbers 

When sampling with the table indicator representation, one repeatedly needs ratios 
of Stirling numbers. Denote the Stirling number ratios 



U, 



nn+l 



t,a 



V, 



on 
>~'t,a 



an ■ t,a on ' \ '^) 

'-'t,a ^t-l,a 

These have the advantage that they do not need to be stored in log space. The first 
ratio, ?7"„, is readily computed from the second, V^" , so is not stored: 

[/", = n — a for n > 1 



U, 



t,a 



yn 



+ (n — to) 



for n > t > 1 



The second formula follows directly from the linear recursion for S*",. The second 
ratio, V^'^, has the following recursion 

1 



V" 



K 



ra+l 
t,a 



Tjn-l 

1 
Tjn 



l + {n-ta)V,l 



for n > 2 



for n > t > 2 



(14) 



Thus a simple recursion yields tables for VJ", from which ?7", can be computed with- 
out any resort to the log storage of the Stirling numbers or the use of transcendental 
functions. 

Moreover, these ratios yield more accurate computation because the direct log 
computation of the Stirling numbers leads to some loss of precision. Values in Ta- 
ble |3] computed with discount parameter a = 0.5 presents the results using floating 
point (32 bit) calculation versus the double precision results (64 bit) as a proxy 
for evaluating round-off error. One can see that the linear recursion in log space. 



Equation (12), yields considerably less accurate results than the ratio recursion of 



Equation (14) when computation is done with floats rather than doubles. Compu- 



tation using the ratios can also be upto 5 times faster. 



9 Discussion 

For the non-atomic case of the two parameter Poisson-Dirichlet distribution, consis- 
tency, convergence and posterior results have been presented, mostly drawn from the 
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literature, though some proofs are given in the Appendix. We have augmented these 
results with a number of plots to illustrate the nature of the underlying distributions. 
Most significantly, in practice we recommend fitting one of the two parameters a or 
h of the PDP or PDD when performing inference with them. 

Chinese Restaurant distribution: The distribution on partitions induced by 
the two parameter Poisson-Dirichlet distribution, called here the Chinese Restau- 
rant distribution, has also been presented, together with a summary of its use in a 
scheme for generating trees. The fragmentation and coagulation duality for the Chi- 
nese Restaurant distribution and the two parameter Poisson-Dirichlet distribution 
followed: coagulation allows simplification of some hierarchical Poisson-Dirichlet dis- 
tributions, and fragmentation allows simplification of some nested Poisson-Dirichlet 
distributions. 



Improper Priors: The infinite probability vector underlying the Poisson- 
Dirichlet distribution was shown to be in the form of an improper Dirichlet on 
any finite sub-vector. As soon as data is presented, the posterior becomes proper 
(or normalised). Alternatively, if a size-biased ordering of the probabilities in the 
infinite vector is made, the knowledge implicit in making a size-biased order turns 
the improper Dirichlet into a GEM distribution, the standard "stick-breaking" dis- 
tribution most commonly used to define a Poisson-Dirichlet distribution. 

Discrete distributions: For the discrete case, not well covered in the Probability 
and Statistics literature, posterior results have also been presented. Moreover, it 
has been shown that the two parameter Poisson-Dirichlet Process with discount 
a > and concentration 6 on a discrete base distribution behaves rather like a 
Dirichlet distribution with concentration ^^. The Dirichlet Process on a discrete 

l+a 

base distribution is identical to a Dirichlet distribution. 



Semi-conjugacy: While the posteriors presented for the discrete case of the PDP 
introduced latent values, multiplicities or indicators, they are conjugate to the multi- 
nomial family. This means the Poisson-Dirichlet Process and the Dirichlet Process 
can be used hierarchically and remarkably a form of semi-conjugacy applies {i.e., 
conjugacy at the expense of introducing latent variables). 

Second order Stirling numbers: Computation of the Stirling numbers needed 
to perform Gibbs sampling was also presented. The closed form for computing Stir- 
ling numbers looks like a difference approximation to an M-th order derivative, thus 
it is intrinsically unstable to compute for higher values of M . The standard linear 
recursion therefore seems necessary, although double precision is needed for larger 
values. Ratios of Stirling numbers, however, can be more accurately computed. 
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A Proofs 

A.l Proofs for Section [5] 
A. 1.1 Proof of Lemma I 



Build on the result from Lemma 14 using Definition [8] The formula of Lemma 14 
also applies to I^, the indices In with size-biased ordering applied, so p(/jv), but 
remove the terms in H(X*). Using the notation of Definition ^ and this lemma, we 
get the form 

Now one can marginalise out the entires in I^ but keeping the constraint that there 
are M distinct fc's in there, which will affect the last product of M terms only. 

The indexes 1, ..., M occur in the sequence I^ of size A^. Ignoring the ordering 
constraints of size-biased ordering, there are A^ choose rii, ...,nM, Cni,...,nM ways the 
indexes can occur in I^. Now adjust this for the ordering constraints. For every 
sequence starting with 1 there exists some starting with 2, ..., M. By symmetry, 
rii/N of the sequences start with 1. Now how many of these have the second integer 
appearing in sequence being 2? Again by symmetry, 77,2/ (A — rii) of the sequences 
starting with 1 have 2 as the next integer in sequence. Likewise, of those sequences 
with 1, 2 being the first two occurring integers respectively, n^/{N — ni — ^2) have 3 
occurring next. Thus, the number of sequences by size-biased ordering with counts 
ni, ...,nAf are 



fiN TT 

^n\,...,nM 11 



^^ n 



m—1 

m= 



iN-j:T=Tn, 



Inspection shows this evaluates to an integer since each term A — Xli^i^ ^i divides 
into A!. 



To marginalise out the indexes I^ in (15) then, one does 

M 



rir 






{b\a)M /yi V- yr f ^i'^rn - a) n 



m—l 



{b)N ^,; ^^ ^^„^iAr(nrn + i)r(i-a)A-E.=l 



Ui 



The full summation formula for S*^ ^ follows. 



A. 1.2 Generalized Stirling Numbers 



We need the following expressions for generalized Stirling numbers. All but the 
explicit expression (iii) are due to jHSSSj . 
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Theorem 36 (Expressions for Generalized Stirling Numbers) The follow- 
ing expressions all define the same generalized Stirling numbers S{n,k;a,(],r), 
where the parameters a,P,r & IR have been suppressed when constant. 



(o) Implicit: {t\ — a)n = ^ S{n^ k){t — r\ — /3) 



k 



k=0 



Both sides are polynomials in t of degree n. {z\a)n '■= z{z + a) . . .{z + {n — l)a) . 
(i) Linear recursion: S{n-\-l,k) = S{n,k — 1) + [kf] — na + r)S{n,k) 

Boundary cond.: S{n, k) = for k > n, S{n, 0) = (r| — a)n 
(a) Mult, recursion: (j^]S{N,K,a,j3,R)= (any k,r) 

N 

= Y.(l)S{n,k-a,(3,r)S{N-n,K-k-a,(3,R-r) 



n=0 

1 '' 

(Hi) Explicit expression: Sin, k) = T^ ( • ) (— )'^~"'(/3j + r\ — a)„, (T? 7^ ) 

k\ 13" p^ ^J^ 

(iv) Generative fct. : £ S{n, k)-^ = ^^ ^ '^'^^' ° X^^^^^^] ^"'^ ^ °^ 

Proof. The generalized Stirling numbers are defined in |HS88j by (o). Hsu and 
Liu derive expressions (i),(ii), and (iv) from (o): It is easy to verify that recursion (i) 
satisfies definition (o). Using (i), one can see that the generating function gk{t) := 
X)^o '5'(^, k)t"-/n\ satisfies the differential equation system 

il+at)-gkit) = gk-iit) + {kp+r)gkit) with gkiO) = and goit) = (l+aty/", 

which has a unique solution. Substituting (iv) into this dgl shows that (iv) is a 
solution. If we take a product of the generating functions of S{n, k; a, (3, r) and 
S{N — n,K — k;a,f3,R — r), use (iv) , and identify the coefficients of t"' in both sides, 
we arrive at the multiplicative recursion (ii). 

Interestingly, Hsu and Liu do not derive the explicit expression (iii), although it 
easily follows by Taylor expanding the r.h.s. of (iv) and by identifying the coefficients 
of t" as follows: The binomial identity gives 

k 

((1 + at)'^/" - 1)'= = ^g)(-)'=-^'(l + at)^-'> 



j=0 

r(7+i) 





1 -; - Z^n=0 \nj- ' "^^^^^ \nj " n\T{-y-n+l)^ 


^S{n,k)- . 

n=0 "'■ 


- ,,,3.1: (-f-E "("*)" 




00 -| fc j.n 
n=0'^-/^ j=0 "'■ 



we get 
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A.1.3 Proof of Theorem flTl 

The proof is based on (a) a recursion for p{M\N)^ (b) the expressions for the gen- 
erahzed Stirhng numbers in Appendix] A. 1.2, and of course (c) the definition rtsl) of 



S^^. In order to distinguish between different M as the sample size N increases, 
use Mtv to denote the value at sample size N . 



(i) We exploit recursion 

h + {m — l)a N — ma 

P{Mn+1 = m\MN) = ^MN=m~l J—^ + lM^=m ^ , 

which easily follows from the predictive sampling distribution ([s]). Multiplying each 
side by p{M]\[), and summing over M^ this becomes 

/,^ N /,^ _,6+(m — l)a ,,^ .N — ma 

p{Mn+i = m) = p{Mn = m - 1) ^ — -——+ p{Mn = m) 



16 



into 



b + N ' -^ ^' ^ 5 + iv 

Inserting the explicit expression p{M]\f = m) = S^^{h\a)m/{h)iq of Lemma 
this recursion and canceling all common factors we get 

^m,a = ^m~l,a + (A — ma)S^^^. 

The boundary conditions S^ ^ = for m > A and Sq^ = 6n,o follow from the explicit 
expression in Definition (IS]) or simply by refiecting on the meaning oi p{Miy = m). 

(ii) and (iii) Consider the generalized Stirling numbers S{n,k;a, P,r) for the 
special choice of parameters (a,/3,r) = (— 1,— a,0). For this choice, recursion (i) 



of Theorem 36 reduces to recursion (i) of Theorem 17, including the boundary 
conditions. Hence S^ ^ = S{N, M; —1, —a, 0). 



It is easy to see that also (ii) and (iii) of Theorem 36 reduce to the first expression 



of (ii) and (iii) of Theorem 17 for (a,/3,r) = (— 1,— a, 0), which shows that the 
expressions are equivalent. 

The special case for (iii) when a = holds by noting that (iii) for a > is in fact 
an M-point interpolation to an M — 1-th partial derivative. As a — ;■ 0, this becomes 
the partial derivative. 

The last expression in (ii) follows from the definition of S"^ ^ in (pi) by splitting 
the sum into I]ni=i ^^^ J2n2+...+nM=N-ni ^"^^ the product into m = 1 and m > 1, 
and identifying the terms with ( ~M, 5*")^ and Sf^Z^]a- 

Note that the first expression in (ii) does not reduce to the second expression for 
m = 1. Nevertheless, the (very different!) derivations of the two expressions show 
that they must be equal. 
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(iv) Using r{N + x)/r{N + y) ^ N^'^ for large A^, we see that the r/i-contribution 
in (iii) is asymptotically proportional to 

^\, , T(N-am) -amT(N)T(N - am) n^oo -amT(N) 1 

I I (/l — CbTfl) = — = — 

/^}q^ ' r(-am) r(l- am) r(A^) V{1 - am) N""^ 

Due to the factor m, the m = term does not contribute. So the dominant contri- 
bution is from m = 1, followed by m = 2, etc. The m, = 1 term yields 

.N ., 1 .. «r(iv) 1 _ 1 1 r(iv) 



SZ. ^ -^T^^M 



^'" M\a^ T{l-a)N^ r(l-a) r(M) a^^-i A^" 

The relative accuracy is O^M/N""), i.e. the approximation is good for M <^ A^*^. The 
smaller a, the larger N needs to be to reach a reasonable accuracy. Higher m-terms 
may be added, but the alternating sign indicates cancelations and hence potential 
numerical problems. 

(v) follows from S^.jq = S{n,k; —1,0,0) = |5'(n, A;; 1, 0, 0)| and the fact that 
S{n, k;l,0, 0) are Stirling numbers of the first kind from |HS88j . 

A. 1.4 Proof of Lemma 

We need to differentiate the different M that results from the partition sample Jj^ 
as A^ increases. Subscript M as Mjy so we can differentiate it as A^ changes. When 
M^v is known, the following series relation holds: 

^ r , , n b + Mj^a _ b a + b + N ^^ 

Taking expected values across Mn yields the recursive form 

The equation for IE [Mn] given in the lemma is proven from this by induction, with 
the value 1 when A^ = 1. Note the derivation of the solution to the above recursive 
formula was made by unfolding the recursion into a summation, and then simplifying 
the summation using hypergeometric functions. 

The approximation for IE [M^] given in the lemma is derived for N,b ^ a as 
follows: 



(a + b) 



N 



(b) 



N 



exp (log T{a + b + N)- log T{b + N) - (log T{a + b) - log r(6))) 
exp{a{Mb + N)-Mb))) 

NY ( aN \ 
1 + -J- exp 



bj '\2b{b + N)^ 
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The first approximation step makes a first order Taylor expansion since a is small, 
< a < 1, and the second approximation step uses an approximation for ipoip) with 
error 0(1/6^). 

For the expected variance, a similar strategy is used but the steps are more 
complicated. The following series relation holds: 



IE 



Mn 



M'^+1 



{Mn + 1)^ + ^ — T^M^ 



N + b 
b 26 + a 



N + b 



N 



N + b N + b 



Mn + 



2« + & + iV^2 



N + b 



N 5 



where IE [M^] = 1 when A^ = 1. Taking expected values over M^- yields the recursive 
form 



IE 



M'n+i 



b 2b + a^ ,_ , 2a + b + N^ 



Ml 



Evaluation of this recursive formula can be made as before, and the result is the 
formula 



IE 



M^ 



b{a + b) (2a + b)N b{2b + a) {a + b)^ b"^ 



(b) 



N 



(b) 



N 



The result then comes from evaluating IE [M^] — (IE [Mtv]) and simplifying terms. 
The approximation proceeds as before. 

To handle the case where a = the same recursive formula for IE [M^] and 
IE [M^] can be used, but are evaluated differently since a = 0. The closed form 
formula for IE [Mn] follows clearly by induction on A^. The closed form formula for 
IE [M^], readily proven by induction, is 

b{Mb + N)- Mb)) + b^ (Mb + N)- Mb))^ + b\Mb + N) - Mb)) • 

Subtracting off (IE [Mtv]) yields the result for Var [Mat]. The approximations for 
both IE [Mn] and Var [Mn] follow by taking the first order terms of ^o(') aiid ipii-), 
the log and the inverse respectively. 



A. 1.5 Proof of Lemma 



The value M is equal to the number of indices that have a non-zero count in the 
sample of size A^. Given probability vector q, the probability that index k has a 
non-zero count after A^ samples is 1 — (1 — Qk)'^ ■ Summing these over all k gives an 
upper bound. 

To generate bounds on 1 — (1 — g^)^, note 1 — (1 
closer the larger g^. Second, by Taylor expansion 



Qk)'^ < 1, and the bound is 



1 - (1 - qk) 



N 



N{N-1) ,2 
Nqk - -^ -q'k 
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for some < g[. < g^. So 1 — (1 — g^)^ < Nq^, and the bound is closer the smaller 
gfc, especially when Nqk <^ 1. Put these two bounds together and we get for any- 
positive integer m 

oo oo 

^ 1 - (1 - g^)^ < m + N J2 ^k- 

k=l fe=m+l 

For the geometric series, qk = r'^~^(l — r), the sum in the bound evaluates to r™, 
so we seek to minimise m + Nr"^. This bound can be modified if we let m G iR"*" to 

min m + Nr"^ < min 1 + m + Nr"^~^ 



Differentiating yields a minima at r"^~^ = ^^^ ^ . The result follows by substitu- 
tion. 

For 
integral 



TV log 1/' 

tion. 

For the Dirichlet series, qk = ^h^, the sum in the bound can be bounded by an 



oo „< 

E ?^ < / 



+1/2 (kycis) 



k=m+l 

-1 



m+1/2 



{ky-^C{s){s - 1) 
1 

{m + l/2Y-^as){s-l) 

As before, modifying the bounds yields the formula to minimise 

N 
m + 1 + 



{m-l/2y-X{s)is-l) ■ 
Differentiation gives a minimum at A^ = (m — l/2)^({s) and so the bound follows. 

A. 2 Proofs for Section [6] 

A. 2.1 Proof of Theorem 1221 

One sees the samples generated by the fragmentation process. Given a partition 
with partition count L with occurrences ni,...,nL totalling A^, one needs to assign 
every entry / to a latent cluster ki. This results in a coarser partition with count K 
bins, and with total occurrence m^ = J2i:ki=k^i (where m^ = \Pm\ for the latent bin 
Pm) from tfc = Y.i:ki=k 1 original bins, totalhng Y.k=i ^k = L. The full probability for 
the fragmentation components is (note 02 could also be zero here, but these terms 
correctly cancel, so the parallel equations for a2 = are not given): 

^7;^n(i-«i«2U-i ■ n \ , A n (i-^On,-! 

If^M fc=l fc=l l-ai«2jmfe j:fc^=fc 
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(l-ai)„,_i(6|aia2)xll 



[o)n i^i j^^i —aia2 

I L K 

-a^Ylil -ai)„,_i • (6/01102)^- n(l -^2)tfc-i 



^">N 1=1 k=l 

We need to marginalise over all possible coarser partitions or assignments to ki. Now 
the second term is in the form of a CRD ({1, ...,/}, 02, b/ai) so marginalising out the 
tk yields (fe/ai)^, which multiplied by af gives (6|ai)L. The result is that the above 
probability becomes 

-JJ-^ ll(l-«l)n,-l ■ 

[b)N ,=1 

This is the CRD ({1, ..., A^}, ai, 6). 

A. 2. 2 Proof of Theorem 1231 

Consider a sample partition with a bin j with Uj entries, however we do not know 
which sub-partitions (from the unknown index k) the entry comes from, nor how 
many there were. So now each bin j would be sub-partitioned into Lj bins with 
occurrence counts m^y each where Y.i=ii^j,i = f^j- In total, there are L = Y.j=iLj 
bins. With this assignment done, the probability for the coagulation becomes: 



J,( \ / J-/-^ J 



{h\aia2)j 



(h\ ( 7^rTjn(-«i«2|ai)L,n(l-«i)m...-i • 

Note the last formula is undefined for 02 = 0, however, the zero terms cancel 
and a parallel proof for the 02 = case is readily seen to hold. To construct 
the marginalised probability, we have to marginalise over all possible sub-partitions 
[Lj bins with occurrence counts ttij^i). The term inside the n/=i is the form of a 
CRD ({1, ..., Lj}, ai, —0102) so marginalising out the partitions (represented by Lj 
andm^-;) yields (— aia2)n = (1 — 0'ia2)n -i(— ^102) (again, noting a similar argument 
applies for the 02 = case). The probability simplifies then to 

(%l«2)j /r.^ ^ 

— ll(l-aia2)„^.-i . 

This is then CRD ({1, ..., A^}, 0102, h) as required. 

A.2.3 Proof of Theorem [2l 

To prove the result, it is sufficient to show for all samples, the two have equivalent 
marginalised posteriors. Consider a sample partition, for a given bin each entry 
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would be drawn with a probability Pk(lj,k, however we do not know the j or k 
associated. We have to assign the k, whereas the j term is simply a sub-partition 
so we do not need to know it. Given a sample partition like this of L bins with sizes 
ni, ..., n^ totalling A^, one needs to assign every entry / to a cluster ki. This results in 
a coarser partition with K bins, with total count rrik = J2i -. ki=k ^i from tk = J2i: ki=k 1 
original bins, totalling Yl,k=i ^k = L. But with the clusters ki assigned, the posterior 
with the marginalised terms for p and g^'s can be given. This posterior corresponds 



to the situation in Theorem 22 , so apply that result and this converts the posterior 
into a probability for CRD ({1, ...,N},ai,b). This is the required posterior for the 
sample from FDD (ai, b). 

A. 2. 4 Proof of Theorem [26] 

As before, to prove the result, it is sufficient to show for all samples, the two have 
equivalent marginalised posteriors. Consider a sample partition with a bin j with Uj 
entries, each entry would be drawn with a probability J2k:jk=j Pk, however we do not 
know which sub-partitions (from the unknown index k) the entry comes from, nor 
how many there were. So now each bin j would be sub-partitioned into Lj bins with 
occurrence counts m^y each where Y.i=if^j,i = f^j- In total, there are L = Y.j=iLj 
bins. With this assignment done, the marginalised posterior can be written down 
from the term for p and the term for q, however this corresponds to the situation 
from Theorem [23j Applying that result yields a marginalised posterior in the form 
of a CRD ({1, ..., A},aia2, 6). This is the required posterior for the sample from 
PDD(aia2,6). 

A. 3 Proofs for Section [T] 

A. 3.1 Proof of Lemma 1291 

Consider the prior measure for pi, ...,pm,Pm- Do a change of variables to 
Pi,...,PM-i,qM,PM-i where qm = Pm/Pm-i and pXj_^ = pu + Pm- The Hes- 
sian of this change is l/p\.j_i, and the domain goes from the constraint set 

{Pl > 0,...,PA/ > 0,pIj > 0} to {pi > 0,...,PM-l > 0,Pm-1 > 0,0 < QM < 1}. 

The prior measure can thus be converted to 

m=l 

under the new constraint set. Note the prior measure on sub-vector pi, ...,pm-i, as 



given in Definition 28, appears in the second half of this measure. The initial part 
involves only qm, but its constraints are simply < Qm < 1 which are independent 
of the remaining variables. Thus one is left with a measure on pi, ...,pj\/_i. The 



measure on the sub- vector is now consistent with Definition 28 , We can repeat this 



process recursively to verify consistency for any other sub-vector. 
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A. 3. 2 Proof of Theorem I 



Consider Definitions 28 and pi Define Gs in terms of its projection on the finite 



, b+M a—1 



sub-spaces {pi-,P2, ■■■,Pm} for aU M. Let 

p{Pi,P2,-;Pm,Pm) oc [pjjj 11 Pm" , (16) 

m=l 

where pj^ = 1— I]m=i Pm ^^^ ^^e domain is constrained to be Pm > {^—Y.^^ Pi)^ for 



m = 1, ..., M, and p|/ > 0. Note that by Definition 28, 6 > —a, and thus h + Ma > 
0. Exploiting pm > S for m = 1,...,M, we show below that the proportionality 
constant, i.e. the integral over the constrained simplex, is finite. To show Gs is 
proper, we need to show that the finite priors for each M are proper and that 
consistency holds between these priors for different M. 

The normalization is done as follows. Use the same change of variables as in the 



proof of Lemma 29 however now the domain is different. The constraint set for the 
initial variables is 

Cp,M = IP1>S, ...,Pm > U - ^ Pi) S, ...,Pm >il-^PijS,plj>0\ . 

By the change of variables this gets mapped to 

Cg,M = lpi>S, ...,PM-1 > I 1 - J2 Pi] ^^PM-1 >0,S<qM<l} ■ 

For the purposes of integration, denote the initial and changed variable sets as p 
and g* respectively. Thus the integration works as follows: 



^a,b,M,5 



r / , \b+Ma~l SL -, 

L {pm) n p'n^-'dp 

-'Cp.M rn=l 

i (pt,-0'"""'°" nVr-' (1 - «m)'^"°- fir'd, 

•"-^q,M m=l 

/■I 
ry I fi \b+Ma—l — a— Ij 

Za,bM-i,& / (1-9) Q dg 



M 



ivi „i 

= - = n / (i-g)'-'"'"-'g-'^-Mg. 

m=l 

Note this is bounded above by bounding the g~"~^ terms from inside the integral 
with 5~"~^, and extending the integrals to the range [0,1]. This yields the upper 

bound 5-^^("+l) n*^ , ^{b+ma) 
UOUnu llm=l r(b+ma+l)- 

Now we prove consistency. We need to show that the projection from the sub- 
set m = 1,...,M down to some smaller subset m = 1,...,M' < M is consistent. 
The change of variables above handled the case where p{pi,P2, ■■■,PM,pti) ^^s pro- 
jected down to p{pi,P2, ■■■,Pm-i,Pm-i)- Clearly, the projected prior is equivalent 
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to the direct definition above (see Lemma 29 for details). Tlius by induction, one 
can project tlie prior from the subset m = 1, ...,M down to a any smaller subset 
m = 1, ...,M' < M, and get the same prior. By this condition, and Kolmogorov's 
Consistency Theorem, it follows that the prior Gs exists and is proper for the full 
Hilbert space of p. 

Now consider the posteriors for a given sample I^. The posterior for In using 



the improper prior on PDDs is given in Lemma 31 To deal with the proper prior 
Gs, the notion of partition size is needed, as given in Definition [8j Let M^ be the 
partition size for a In, then p{pi,p2, ...,Pm,pXi\G5,In) is proportional to 

/ , \b+Ma—l -,— r 1 -TT- 1 

{p^m) n p;:~' li pt-^-' ^ 

m=Afjv+l m=l 

where the constraints Cp^M hold as before. This is the same form as the posterior 



Dirichlet distribution (I) given in Lemma 31 where the probabilities are further 



constrained by Cp^M- The normalizing constant can be worked out as above to be 

Mn 

Za,b,MN,s = n Bi-s{b + ma, rim - a) 



m=l 



where Bx{u,v) = /Q^t"~^(l — ty~^dt is the incomplete Beta function defined for 
u,v > 0. In our case, rim > for all 1 < m < Mn and a + 6 > 0, so the 
Beta function and incomplete Beta function are well defined. Note the normalizing 
constant for the posterior Dirichlet distribution (I) given in Lemma 31 is Za^b,MNfi- 



Now consider the Li distance between the two posteriors, 
p{pi,P2, ...,PM,PM\Crs, In) and the posterior Dirichlet distribution (I) given in 



Lemma 31 Note these differ only in domain. Denote them by Ps and Pq 



respectively. Using Ps > Pq on Gs, and Ps = on Go\Gs, and Jq^ Psdp = 1, we get 



1 

2 


di{Ps,Po) : = 

^ \SgJ'- 


sup \Ps 
A 

Podp 


A]-Po[A] 






^2l^- 


Po dp+ \j \Ps- 
2 Jgo\Gs 


^ol 




- IL/^''^ 


2Jgs 


Podp+ I f 

2 Jgo\g 


s 




= 1-1 Podp = 1 

JGs 


Za,b,MM,S f 
Za,b,M,^,0 JGs 


Psdi 




_ -. Za^b,MM,S Q 


for (5^0 





^a,b,MM,0 

This implies convergence in distribution. 



48 The Poisson-Dirichlet Process 



A. 3. 3 Proof of Lemma 

Proof sketch. Note for the Proper Posteriors II claim, since H{-) is non-atomic, 
each distinct data X^ has a corresponding distinct index k^, thus for the purposes 
of analysis, assume the indices are given and w.l.o.g. they follow size-biased ordering, 
so k^ = m. Thus to prove the Proper Posteriors I and II claim about posteriors for 



p, multiply the prior measure for (pi, ...,pm) of Definition 28 by the likelihood, which 
is in terms of the same sub- vector, and the posterior measure clearly is proportional 
to the corresponding posterior Dirichlet in this lemma. The remaining part of the 
Proper Posteriors II claim follows from the model family. 

To prove the Sampling claim, note that this just takes the expected value of 
the posterior in Proper Posteriors II. To prove the Stick Breaking claim, note this 
follows directly from the posterior by standard properties of the Dirichlet. 

The size-biased sampling claim is developed sequentially. First note pi has the 
improper prior Beta(— a, 6 + a). The value ^2/(1 — Pi) is apriori independent of 
Pi and has a the improper prior Beta(— a,6 -|- 2a). The value Ps/il — Pi — P2) is 
apriori independent of pi and p2 and has a the improper prior Beta(— a, 6 + 3a), etc. 
However, because pi is size-biased we know it is the first in the sample, so we add 
one to get a Beta(l — a, 6 + a). Likewise, we also know p2 appears first in the sample 
(after pi), since it is sized-biased, so again we add one getting Beta(l — a, 6 + 2a). 
Repeating this yields the standard stick-breaking definition of the GEM distribution. 

A. 4 Proofs for Section [8] 



A.4.1 Proof of Corollary 33 



Proof sketch. In the general case where each draw from H{-) is not nec- 
essarily almost surely distinct, the formula of Lemma [14] also applies to 
p(Xi, X2, ..., Xat, fci, ..., /ctv)- Now one can marginalise out the fci, ..., /cat, which will 
affect the last product of M terms only. 

Given the constraints that tm represents the multiplicity of X^ and Uk represents 
the total count of X^, then all values for ki, ...,kN must be included that satisfy the 
constraints. Each n^ will be partitioned into tk different indices, each occurring 
at least once, and totaling n^. Thus the problem of marginalising out the indices 
ki, ...,kj^ to the multiplicities ti, ..., ^m is equivalent to the summation over configu- 
rations by size-biased ordering, done for Lemma [16} and an identical result can be 
applied. 

A. 4. 2 Proof of Lemma 1351 

Let p ~ PDP{a,b,H). Let q ~ PDD(a,6) be the underlying PDD, and let the 
corresponding independent samples from H{-) be Xi G A/". From the definition of a 
PDP, 
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Taking the expected value of this over X, yields J2i Qidk-, and hence 9^ irrespective 
of q. 

Now consider any moment. We present one case, and others can be treated 
similarly. For ki,k2, k^ three indices 



IE 



q,X 



Yl (li^Xi=ki -6ki] \J2 (li^Xi=k2 -Oki] [Yl (li^Xi=k3 - 0. 



ka 



IE 



q,X 



Y Ihlhlh \}Xi^=k^ - 6ki) [Ixi2=k2 - ^ki) [lXi^=k3 - 0k:i 



Now Xjj is independent of Xi^ whenever /i ^ I2. So we have to express the sum 
Y.h,i2,h ^'^^^ different equal and unequal parts so that the expected value over X can 
be applied. This would be 

E ■ = E ■+ Y ■+ Y ■+ Y ■+ Y ■ 

h,h,l3 hthth disjoint h=l2T^h h=hT^l2 l2=hT^h h=h=h 

Any sum which has a term with one index, l\ say, not equal to the others, will 
contain the expression 



IE 



X, 



LX,^=fcl - t7fc 



e. 



Ok, — 



1 "Kl 



0, 



and hence can be discarded. Thus for the first three central moments, the expansion 
of sums that remains non-zero are 



E- = E- 
E • = E • 

Il,l2,l3 h=l2=h 

E ■ = E ■+ E ■+ E ■+ E ■ 

h,h,l3,U h=h7^^3=U h=h¥'h=U h=li¥^l2=l3 h=h=h=l4 

Applying these summations to the three moments leads to: 



IE, 



IE, 



IE, 



E^f 
E^f 



L I 



E^f 



L I 



lEx [{lx=ki — dkj {lx=k2 — ^^2)] 

lEx [{lx=ki — ^fci) (lx=fc2 — ^fca) (lx=fc3 — dk-i)] 

JEx [{lx=ki — dki) (lx=fc2 — ^^2) (lx=fc3 — dk^) {lx=k4 — ^fc4)] 
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+ Eo 
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iJEx [{lx=ki — dki) (lx=fc2 — ^^2)] ^x [{lx=k3 — 6ki) (lx=fc4 — ^^4)] 
lEx [{lx=ki — dki) (lx=fc3 — dhi)] lEx [{^-x=k2 ~ ^^2) (lx=fc4 — ^^4)] 
lEx [(lx=fci — ^fei) (lx=fc4 ~ ^^4)] lEx [(lx=fc2 — ^te) (lx=fc3 — ^fca)] 
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The Poisson-Dirichlet Process 



The expected sum of powers of q we solve for below. The expectation of X 
is for the multivariate discrete (or a multinomial with A^ = 1), so the val- 
ues are known for the various cases of ki,k2,.... For example, when ki ^ k2, 

Ex [{lx=ki — dki) i^x=k2 — ^fe)] = — ^fci^fc2- 

The expected sum of powers of q is obtained as follows. From first principles, it 
can be seen that 



lEg-[M] = IE,- l-(l-gfc) 



^N 



For N = 2 and rearranging terms we get 



1E,-[M2] = 2-lE^ 
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Applying Lemma 18 one gets the closed form expression for the left-hand side. 
Likewise, we get: 
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1 — a 

Y+b 

(l-a)(2-a) 

{l + b){2 + b) 

(l-a)(2-a)(3-a) 

{l + b){2 + b){3 + b) 

(l-a)(2-a)(3-a)(4-a) 

(l + 6)(2 + 6)(3 + 6)(4 + 6) 



Combining the resultant formula yields the cases in the lemma. 



